Skip to content

New slurm customization parameters (account, containers)#1209

Merged
Kipok merged 17 commits intomainfrom
igitman/account-arg
Feb 27, 2026
Merged

New slurm customization parameters (account, containers)#1209
Kipok merged 17 commits intomainfrom
igitman/account-arg

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Feb 3, 2026

Summary by CodeRabbit

  • New Features

    • Added a global --account option to specify a Slurm account for job submissions.
    • Added container override options (--main-container, --sandbox-container, --judge-container, --judge-server-container, --container) to select non-default images; these overrides propagate across all job/task creation flows.
  • Tests

    • Updated generation tests to accept the new account parameter.

Kipok added 2 commits February 3, 2026 11:52
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional Slurm account and per-component container-image override CLI options across multiple pipeline commands and threads them through task creation, HardwareConfig, executor creation, and sbatch submission so tasks run with specified account/container choices while falling back to existing defaults.

Changes

Cohort / File(s) Summary
Convert CLI
nemo_skills/pipeline/convert.py
Added account and container CLI options; resolve container as container or container_map[(convert_from, convert_to)] and pass account into add_task.
Eval CLI & Judge Tasks
nemo_skills/pipeline/eval.py
Added account, main_container, sandbox_container, judge_container, judge_server_container options; threaded into _create_llm_judge_tasks and downstream task creation so main/judge/server receive container and account overrides with fallbacks.
Generate pipeline
nemo_skills/pipeline/generate.py
Extended _create_job_unified and generate CLI with account, main_container, sandbox_container; client and sandbox commands prefer overrides; HardwareConfig now populated with account.
Run command & Start server
nemo_skills/pipeline/run_cmd.py, nemo_skills/pipeline/start_server.py
Added account and sandbox/main container CLI options; launch_server/run_cmd resolve container via override or defaults and pass account and sandbox_container into add_task/server launch.
Evaluator context & hardware
nemo_skills/pipeline/nemo_evaluator.py
Added account to evaluator CLI and _TaskCreationContext; _hardware_for_group accepts account and includes it in HardwareConfig used for sbatch kwargs.
Declarative & Exec plumbing
nemo_skills/pipeline/utils/declarative.py, nemo_skills/pipeline/utils/exp.py
Added account field to HardwareConfig; extended get_executor and add_task signatures to accept account and sandbox_container, resolving account with fallback to cluster config and passing it into executor/sbatch kwargs.
Tests & Inference tweak
tests/test_generation.py, nemo_skills/inference/model/tool_call.py
Updated test to pass new account=None to _create_job_unified. Minor change in generate_async to decrement tokens_to_generate by produced tokens when integer.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "User CLI"
    participant Pipeline as "Pipeline (convert/generate/eval/run_cmd/start_server)"
    participant AddTask as "add_task"
    participant Executor as "get_executor"
    participant Slurm as "Slurm/sbatch"

    CLI->>Pipeline: invoke command with account & container overrides
    Pipeline->>AddTask: build task params (container = override or default, account)
    AddTask->>Executor: request executor(container, account, hardware...)
    Executor->>Slurm: submit job (sbatch kwargs include account)
    Slurm-->>Executor: job id
    Executor-->>AddTask: executor handle
    AddTask-->>Pipeline: task registered
    Pipeline-->>CLI: return task info
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • activatedgeek
  • Kipok
  • i-vainn
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'New slurm customization parameters (account, containers)' accurately describes the main change: adding new CLI options for Slurm account and container overrides across multiple pipeline modules.
Docstring Coverage ✅ Passed Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/account-arg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/pipeline/nemo_evaluator.py (1)

560-590: ⚠️ Potential issue | 🟡 Minor

Don’t silently ignore exclusive.

The parameter is accepted and threaded through, but never applied. Either honor it via sbatch_kwargs or fail fast when it’s set so users don’t think they’re getting exclusive nodes.

Suggested fail-fast guard
 def _hardware_for_group(
     partition: Optional[str],
     account: Optional[str],
     num_gpus: Optional[int],
     num_nodes: int,
     qos: Optional[str],
     exclusive: bool,
 ) -> HardwareConfig:
+    if exclusive:
+        raise ValueError("exclusive is not supported for nemo_evaluator jobs yet; remove --exclusive.")
     return HardwareConfig(
         partition=partition,
         account=account,
         num_gpus=num_gpus,
         num_nodes=num_nodes,

As per coding guidelines, avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided.

nemo_skills/pipeline/eval.py (1)

816-866: ⚠️ Potential issue | 🟠 Major

Account override is missing for summarize/compute-score tasks.
When a user specifies --account, these Slurm tasks still run under the default account and can fail on clusters without a default. Please propagate account=account in both add_task calls.

🔧 Proposed fix
                 summarize_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{benchmark}-summarize-results",
                     log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         dependent_tasks if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )
@@
                 score_task = pipeline_utils.add_task(
                     exp,
                     cmd=command,
                     task_name=f"{expname}-{group}-compute-score",
                     log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs",
                     container=cluster_config["containers"]["nemo-skills"],
                     cluster_config=cluster_config,
+                    account=account,
                     run_after=run_after,
                     reuse_code_exp=reuse_code_exp,
                     reuse_code=reuse_code,
                     task_dependencies=(
                         group_tasks[group] if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies
                     ),
                     installation_command=installation_command,
                     skip_hf_home_check=skip_hf_home_check,
                     sbatch_kwargs=sbatch_kwargs,
                 )

As per coding guidelines, Avoid silently ignoring unused user-passed parameters. The code should fail if a user specifies an unsupported argument or if a required argument is not provided. Use dataclasses or **kwargs syntax to handle this automatically.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good. Have a minor comment about goals for the future with this, but I don't think it requires action.

Comment on lines +367 to +369
main_container: str = typer.Option(None, help="Override container image for the main evaluation client"),
sandbox_container: str = typer.Option(None, help="Override container image for the sandbox"),
judge_container: str = typer.Option(None, help="Override container image for GPU-based judges (comet, nvembed)"),
Copy link
Collaborator

@gwarmstrong gwarmstrong Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little bulky to have separate override arguments for each container everywhere. Not sure that there is a better solution though. If we wanted to have overrides like we do for tools, e.g.,

++container_overrides.sandbox = "..."
++container_overrides.judge = "..."

But then the choice of key is unclear--since our "job components", e.g., Judge, main, sandbox, ... don't map cleanly to a container name (e.g., "judge" -> containers[judge_server_type], main -> containers["nemo-skills"], sandbox -> containers["sandbox"]).

So I think with the current structure, what you've done the best choice, but maybe we can eventually work toward something a little more general here.

Kipok and others added 5 commits February 3, 2026 19:34
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>

# Conflicts:
#	nemo_skills/pipeline/start_server.py
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

807-858: ⚠️ Potential issue | 🟠 Major

account not forwarded to summarize/score tasks — can cause job rejection on enforced-accounting clusters.

The account parameter accepted by eval is propagated to main eval tasks and judge tasks, but the summarize_results task (line 807) and compute_group_score task (line 841) silently omit it. On Slurm clusters that require account specification for every job submission, these tasks will fail or be billed to the wrong account.

🐛 Proposed fix
             summarize_task = pipeline_utils.add_task(
                 exp,
                 cmd=command,
                 task_name=f"{expname}-{benchmark}-summarize-results",
                 log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results",
                 container=cluster_config["containers"]["nemo-skills"],
                 cluster_config=cluster_config,
+                account=account,
                 run_after=run_after,
             score_task = pipeline_utils.add_task(
                 exp,
                 cmd=command,
                 task_name=f"{expname}-{group}-compute-score",
                 log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs",
                 container=cluster_config["containers"]["nemo-skills"],
                 cluster_config=cluster_config,
+                account=account,
                 run_after=run_after,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 807 - 858, The summarize_results
and compute_group_score tasks omit the account setting; when creating
summarize_task and score_task via pipeline_utils.add_task (the calls that create
summarize_task and score_task), forward the account parameter (e.g.,
account=account) so the job runs under the correct Slurm account; update both
add_task invocations (the summarize_task and score_task calls) to include
account=account (or propagate account from the surrounding eval function/args)
and ensure any sbatch_kwargs/account merging logic remains consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 807-858: The summarize_results and compute_group_score tasks omit
the account setting; when creating summarize_task and score_task via
pipeline_utils.add_task (the calls that create summarize_task and score_task),
forward the account parameter (e.g., account=account) so the job runs under the
correct Slurm account; update both add_task invocations (the summarize_task and
score_task calls) to include account=account (or propagate account from the
surrounding eval function/args) and ensure any sbatch_kwargs/account merging
logic remains consistent.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a89b68 and 3b64aa7.

📒 Files selected for processing (3)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/start_server.py
  • nemo_skills/pipeline/utils/exp.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/utils/exp.py

Signed-off-by: George Armstrong <georgea@nvidia.com>

# Conflicts:
#	nemo_skills/pipeline/eval.py
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

623-674: ⚠️ Potential issue | 🟠 Major

account is silently dropped for summarize_results and compute_group_score tasks.

Both add_task calls (lines 623 and 657) are missing account=account, so a user-specified Slurm account is ignored for these tasks while being respected everywhere else. As per coding guidelines, "avoid silently ignoring user-passed parameters."

🐛 Proposed fix
             summarize_task = pipeline_utils.add_task(
                 exp,
                 cmd=command,
                 task_name=f"{expname}-{benchmark}-summarize-results",
                 log_dir=f"{output_dir}/{benchmark_args.eval_subfolder}/summarized-results",
                 container=cluster_config["containers"]["nemo-skills"],
                 cluster_config=cluster_config,
+                account=account,
                 run_after=run_after,
                 ...
             )
             score_task = pipeline_utils.add_task(
                 exp,
                 cmd=command,
                 task_name=f"{expname}-{group}-compute-score",
                 log_dir=f"{output_dir}/eval-results/{group}/compute-score-logs",
                 container=cluster_config["containers"]["nemo-skills"],
                 cluster_config=cluster_config,
+                account=account,
                 run_after=run_after,
                 ...
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 623 - 674, The summarize_results
and compute_group_score tasks created via pipeline_utils.add_task (referenced as
summarize_task and score_task) are missing the account parameter so a
user-specified Slurm account is ignored; fix by passing account=account into
both add_task calls that create summarize_task and score_task (the two
pipeline_utils.add_task invocations building the summarize-results and
compute-score tasks) so the Slurm account is honored like other tasks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 623-674: The summarize_results and compute_group_score tasks
created via pipeline_utils.add_task (referenced as summarize_task and
score_task) are missing the account parameter so a user-specified Slurm account
is ignored; fix by passing account=account into both add_task calls that create
summarize_task and score_task (the two pipeline_utils.add_task invocations
building the summarize-results and compute-score tasks) so the Slurm account is
honored like other tasks.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b64aa7 and bbf2ea7.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/eval.py

The test calls _create_job_unified() which now requires account as a
positional argument after the addition of the --account CLI option.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_generation.py (1)

176-188: Exercise the non-default account path in this test.

Line 184 currently passes account=None, so this only validates the default path and does not verify that user-provided account values are threaded into the generated job metadata/command. Consider using a sentinel account (e.g., "test-account") and asserting it propagates to the expected output object/args.

As per coding guidelines, "Avoid silently ignoring user-passed parameters; fail if a required parameter is not specified or an unsupported parameter is provided."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_generation.py` around lines 176 - 188, The test currently only
exercises the default account path because _create_job_unified is called with
account=None; change the test to pass a sentinel account string (e.g.,
"test-account") to the account parameter when calling _create_job_unified and
add an assertion that this value is propagated into the returned job
metadata/command (inspect the output object(s) in the test and assert the
account field or command arg equals "test-account"); update any related
assertions that assumed None/default to reflect the explicit account so the
non-default path is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_generation.py`:
- Around line 176-188: The test currently only exercises the default account
path because _create_job_unified is called with account=None; change the test to
pass a sentinel account string (e.g., "test-account") to the account parameter
when calling _create_job_unified and add an assertion that this value is
propagated into the returned job metadata/command (inspect the output object(s)
in the test and assert the account field or command arg equals "test-account");
update any related assertions that assumed None/default to reflect the explicit
account so the non-default path is validated.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbf2ea7 and dd0d94e.

📒 Files selected for processing (1)
  • tests/test_generation.py

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok enabled auto-merge (squash) February 26, 2026 02:05
Per-request inference timeout (120s) and pytest-level test timeout (300s)
for test_eval_gsm8k_api and test_eval_judge_api. Prevents external API
hangs from blocking CI for 8+ minutes.

Signed-off-by: George Armstrong <georgea@nvidia.com>
The judge step in test_eval_judge_api runs as a separate nemo-run job
and doesn't inherit ++inference.timeout from the main generation step.
Pass it via --extra_judge_args to prevent judge hangs too.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Root cause: litellm max_retries=3 (default) compounds with
inference.timeout — a single hanging request can take up to
timeout * (max_retries + 1) = 120s * 4 = 480s, exceeding the 300s
pytest timeout. Setting max_retries=0 ensures a timeout fails
immediately without silent retries.

Signed-off-by: George Armstrong <georgea@nvidia.com>
…l name

Signed-off-by: George Armstrong <georgea@nvidia.com>
@Kipok Kipok merged commit c8abe5d into main Feb 27, 2026
5 checks passed
@Kipok Kipok deleted the igitman/account-arg branch February 27, 2026 17:31
gwarmstrong added a commit that referenced this pull request Mar 4, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants